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Abstract. A conceptual framework for cluster analysis from the viewpoint 
of p-adic geometry is introduced by describing the space of all dendrograms 
for n datapoints and relating it to the moduli space of p-adic Riemannian 
spheres with punctures using a method recently applied by Murtagh (2004b). 
This method embeds a dendrogram as a subtree into the Bruhat-Tits tree 
associated to the p-adic numbers, and goes back to Cornelissen et al. (2001) in 
p-adic geometry. After explaining the definitions, the concept of classifiers is 
discussed in the context of moduli spaces, and upper bounds for the number 
of hidden vertices in dendrograms are given. 



1. Introduction 

Dendrograms are ultrametric spaces, and ultrametricity is a pervasive property of 
observational data, and by Murtagh (2004a) this offers computational advantages 
and a well understood basis for developping data processing tools originating in 
p-adic arithmetic. The aim of this article is to show that the foundations can 
be laid much deeper by taking into account a natural object in p-adic geometry, 
namely the Bruhat-Tits tree. This locally finite, regular tree naturally contains 
the dendrograms as subtrees which are uniquely determined by assigning p-adic 
numbers to data. Hence, the classification task is conceptionally reduced to finding 
a suitable p-adic data encoding. Dragovich and Dragovich (2006) find a 5-adic 
encoding of DNA-sequences, and Bradley (2007) shows that strings have natural 
p-adic encodings. 

The geometric approach makes it possible to treat time-dependent data on an 
equal footing as data that relate only to one instant of time by providing the concept 
of family of dendrograms. Probability distributions on families are then seen as a 
convenient way of describing classifiers. 

Our illustrative toy data set for this article is given as follows: 

Example 1.1. Consider the data set D = {0, 1, 3, 4, 12, 20, 32, 64} given by n = 8 
natural numbers. We want to hierarchically classify it with respect to the 2-adic 
norm as our distance function, as defined in Section^ 



2. A BRIEF INTRODUCTION TO p-ADIC GEOMETRY 

Euclidean geometry is modelled on the field R of real numbers which are often 
represented as decimals, i.e. expanded in powers of the number 10 _1 : 



a v lQ- u , a v e {0,...,9}, m e Z. 
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In this way, K completes the field Q of rational numbers with respect to the absolute 

{x x > 
— . On the other hand, the p-adic norm on Q with 
—x, x < 

'^"jo, x = 

is defined for x = — by the difference v p (x) = f p (ai) — v p (a,2) G Z in the multiplici- 
ties with which numerator and denominator of x are divisible by the prime number 
p: flj = p u p( ai >Ui, and Ui not divisible by p, i = 1, 2. 

The p-adic norm satisfies the ultrametric triangle inequality 

\x + y\ p <m&x{\x\ p ,\y\ p }. 

Completing Q with respect to the p-adic norm yields the field Q p of p-adic numbers 
which is well known to consist of the power series 

oo 

(1) x = 2_. a uP"i °>v S {0, . . . ,p — 1}, m G Z. 

v— m 

Note, that the p-adic expansion is in increasing powers of p, whereas in the decimal 
expansion, it is the powers of 10 _1 which increase arbitrarily. An introduction to 
p-adic numbers is e.g. Gouvea (2003). 

Example 2.1. For our toy data set D, we have 1 1 2 = 0, 1 1 1 2 = 1 3 1 2 = T |4|a = 
1 12 1 2 = 1 20 12 = 2~ 2 ; |32 1 2 = 2" 5 ; |64| 2 = 2~ e , i.e. |-| 2 is maximally 1 on D. Other 
examples: |3/2| 3 = |6/4| 3 = 3" 1 , |20| 5 = 5" 1 , \p~ x \ P = |p|p 1 = p. 

Consider the unit disk D = {x e Q p \x\ p < 1} = Si(0). It consists of the so- 
called p-adic integers, and is often denoted as Z p when emphasizing its ring struc- 
ture, i.e. closedness under addition, subtraction and multiplication. A p-adic num- 
ber x lies in an arbitrary closed disk B p - r (a) = {x G Q p \ \x — a\ p < p~ r }, where 
r G Z, if and only if x — a is divisible by p r . This condition is equivalent to x and a 
having the first r terms in common in their p-adic expansions ([T]). The possible radii 
are all integer powers of p, so the disjoint disks B p -i (0), B p -i (1), . . . , B p -i (p — 1) 
are the maximal proper subdisks of 1D>, as they correspond to truncating the power 
series ([lj after the constant term. There is a unique minimal disk in which D is 
contained properly, namely B p (0) — {x G Q p | \x\ p < p}. These observations hold 
true for arbitrary p-adic disks, i.e. any disk B p -,\x), x G Q p , is partitioned into 
precisely p maximal subdisks and lies properly in a unique minimal disk. Therefore, 
if we define a graph ,^q p whose vertices are the p-adic disks, and edges are given 
by minimal inclusion, then every vertex of has precisely p -I- 1 outgoing edges. 
In other words, is a p + 1 -regular tree, and p is the size of the residue field 
F p = Z p /pZ p . 

Definition 2.2. The tree 3q p is called the Bruhat-Tits tree for Q p . 

Remark 2.3. Definition \2.2\ is not the usual way to define ^q p - The problem with 
this ad-hoc definition is that it does not allow for any action of the projective linear 
group PGL/2(Q P ). A definition invariant under projective linear transformations 
can be found e.g. in Herrlich (1980) or Bradley (2006). 
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An important observation is that any infinite descending chain 
(2) B 1 DB 2 D... 

of strictly decreasing p-adic disks converges to a unique p-adic number {x} = f] B n . 

n 

A chain ((2|) defines a halfline in the Bruhat-Tits tree J^j . Halflines differing only by 
finitely many vertices are said to be equivalent, and the equivalence classes under 
this equivalence relation are called ends. Hence the observation means that the 
p-adic numbers correspond to ends of J^j p . There is a unique end Bi C B 2 Q ■ ■ ■ 
coming from any strictly increasing sequence of disks. This end corresponds to the 
point at infinity in the p-adic projective line P 1 (Q P ) = Q p U {oo}, whence the well 
known fact: 

Lemma 2.4. The ends of are in one-to-one correspondance with the Q p - 
rational points of the p-adic projective line P , i.e. with the elements o/P 1 (Q p ). 

From the viewpoint of geometry, it is important to distinguish between the p-adic 
projective line P 1 as a p-adic manifold and its set P 1 (Q P ) of Q p -rational points, in 
the same way as one distinguishes between the affine real line A 1 as a real manifold 
and its rational points A 1 (Q) = Q, for example. One reason for distinguishing 
between a space and its points is: 

Lemma 2.5. Endowed with the metric topology from \-\ p , the topological space Q p 
is totally disconnected. 

The usual approaches towards defining more useful topologies on p-adic spaces 
are by introducing more points. Such an approach is the Berkovich topology, which 
we will very briefly describe. More details can be found in Berkovich (1990). 

The idea is to allow disks whose radii are arbitrary positive real numbers, not 
merely powers of p as before. Any strictly descending chain of such disks gives a 
point in the sense of Berkovich. For the p-adic line P 1 this amounts to: 

Theorem 2.6 (Berkovich). P 1 is non-empty, compact, hausdorff and arc-wise con- 
nected. Every point o/P 1 \{oo} corresponds to a descending sequence B\ D B% D . . . 
of p-adic disks such that B = f] B n is one of the following: 

(1) a point x in Q p , 

(2) a closed p-adic disk with radius r £ |Q P | P , 

(3) a closed p-adic disk with radius r ^ |Q P | P , 

(4) empty. 

Points of types 2. to 4. are called generic, points of type 1. classical. We remark 
that Berkovich's definition of points is technically somewhat different and allows to 
define more general p-adic spaces. Finally, the Bruhat-Tits tree ^5q p is recovered 
inside P 1 : 

Theorem 2.7 (Berkovich). <5q p is a retract of P 1 \P 1 (Q p ), i.e. there is a map 
P 1 \ P 1 (Q P ) — » ^q p whose restriction to f%Q p is the identity map on ■ 

3. p-ADIC DENDROGRAMS 

Example 3.1. The 2-adic distances within D are encoded in Figure QJ where 
dist(i, j) = 2~ V2< ^'^ , if ^2 (i, j) is the corresponding entry in Figure[]l using 2~°° = 
0. Figure&is the dendrogram for D using \-\2'. the distance between disjoint clusters 
equals the distances between any of their representatives. 
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Figure 1. 2-adic valuations for D. 
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Figure 2. 2-adic dendrogram for D U {oo}. 

Let X C P 1 (Q p ) be a finite set. By Lemma \2. 41 a point of X can be considered 
as an end in . 

Definition 3.2. The smallest subtree 2>{X) of 3?q p whose ends are given by X is 
called the p-adic dendrogram for X . 

Cornelissen et al. (2001) use p-adic dendrograms for studying p-adic symmetries, 
cf. also Cornelissen and Kato (2005). We will ignore vertices in @(X) from which 
precisely two edges emanate. Hence, for example, ^({0, 1, oo}) consists of a unique 
vertex v(0, 1, oo) and three ends. The dendrogram for a set X C NU{oo} containing 
{0, 1, oo} is a rooted tree with root v(0, 1, oo). 

Example 3.3. The 2-adic dendrogram in Figured is nothing but 3>(X) for X 
I ) u {oo} and is in fact inspired by the first dendrogram of Murtagh (2004b). The 
path from the top cluster to Xi yields its binary representation which easily 
translates into the 2-adic expansion: = [0000000] 2 , 64 = [1000000] 2 = 2 6 , 32 = 
[0100000] 2 = 2 5 ; 4 = [0000100] 2 = 2 2 , 20 = [0010100] 2 = 2 2 + 2 4 ; 12 = [0001100] 2 = 
2 2 + 2 3 , 1 = [0000001] 2; 3 = [0000011] 2 = 1 + 2 1 . 

Any encoding of some data set M which assigns to each x 6 M a p-adic repre- 
sentation of an integer including and 1, yields a p-adic dendrogram Sl{M U {oo}) 
whose root is u(0, 1, oo), and any dendrogram for real data can be embedded in a 
non-unique way into £Tq p as a p-adic dendrogram in such a way that v(0, 1, oo) rep- 
resents the top cluster, if p is large enough. In particular, any binary dendrogram is 
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a 2-adic dendrogram. However, a little algebra helps to find sufficiently large 2-adic 
Bruhat-Tits trees 2?k which allow embeddings of arbitrary dendrograms into S?k- 
In fact, by K we mean a finite extension field of Q p . The p-adic norm |-| p extends 
uniquely to a norm \-\k on K, for which it is a complete field, called a p-adic 
number field. The integers of K are again the unit disk Ok = {x € K \ \x\k < 1}, 
and the role of the prime p is played by a so-called uniformiser n e Ok- It has 
the property that Ok/^Ok is a finite field with q = pf elements and contains F p . 
Hence, if some dendrogram has a vertex with maximally n > 2 children, then we 
need K large enough such that 2? > n. This is possible by the results of number 
theory. Restricting to the prime characteristic 2 has not only the advantage of 
avoiding the need to switch the prime number p in the case of more than p children 
vertices, but also the arithmetic in 2-adic number fields is known to be computa- 
tionally simpler, especially as in our case the so-called unramified extensions, i.e. 
where diiriQ 2 K = f, are sufficient. 

Example 3.4. According to Bradley (2007), strings over a finite alphabet can be 
encoded in an unramified extension o/Q p , and hence be classified p-adically. 

4. The space of dendrograms 

From now on, we will formulate everything for the case K = Q p , bearing in mind 
that all results hold true for general p-adic number fields K. Let S — {x\, . . . , x n } C 
IP 1 (Qp) consist of n distinct classical points of P 1 such that x\ = 0, X2 = 1, x$ = 
oo. Similarly as in Theorem 12.71 the p-adic dendrogram &(S) is a retract of the 
marked projective line X = P 1 \ S. We call &(S) the skeleton of X. The space 
of all projective lines with n such markings is denoted by 97t n , and the space of 
corresponding p-adic dendrograms by © n -i. 2H n is a p-adic space of dimension 
n — 3, its skeleton is a cw-complex of real polyhedra whose cells of maximal 

dimension n — 3 consist of the binary dendrograms. Neighbouring cells are passed 
through by contracting bounded edges as the n— 3 "free" markings "move" about P 1 
without colliding. For example, SH3 is just a point corresponding to P 1 \ {0, 1, 00}. 
SUI4 has one free marking A which can be any Q p -rational point from P 1 \ {0, 1, 00}. 
Hence, the skeleton TI3 is itself a binary dendrogram with precisely one vertex v 



00 00 00 00 




Figure 3. Dendrograms representing the different regions of 2)3. 
and three unbounded edges A, B, C (cf. Figure [3]). For n > 3 there are maps 

fn+l ■ 9ftn+l — » 4>n+l '■ £>„ — > £>n-l, 

which forget the (n + l)-st marking. Consider a (Q p -rational point x £ corre- 
sponding to P 1 \ S with skeleton d. Its fibre fn+i{ x ) corresponds to P 1 \ S' for all 
possible S' whose first n entries constitute S. Hence, the extra marking A G S' \ S 
can be taken arbitrarily from P(Q P ) \ S. In this way, the space fn+i( x ) can be 
considered as P 1 \ S, and 4>n+i(d) as the p-adic dendrogram for S. What we have 
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seen is that taking fibres recovers the dendrograms corresponding to points in the 
space D n . Instead of fibres of points, one can take fibres of arbitrary subspaces: 

Definition 4.1. A family of dendrograms with n data points over a space Y is a 
map Y — ► D n from some p-adic space Y to D n . 

For example, take Y = {yi, . . . ,Ht}- Then a family Y — > 25 „ is a time series of 
n collision-free particles, if t G {1, . . . , T} is interpreted as time variable. It is also 
possible to take into account colliding particles by using compactifications of Wl n 
as described in Bradley (2006). 

5. Distributions on dendrograms 

Given a dendrogram @ for some data S — {x\, . . . , x n }, the idea of a classifier is 
to incorporate a further datum x £ S into the classification scheme represented by 
3). Often this is done by assigning probabilities to the vertices of ^, depending on 
x. The result is then a family of possible dendrograms for S U {x} with a certain 
probability distribution. It is clear that, in the case of p-adic dendrograms, this 
family is nothing but (j^n+iW ~~ * if d G 3 n _i is the point representing <3. This 
motivates the following definition: 

Definition 5.1. A universal p-adic classifier C for n given points is a probability 
distribution on 9Jt„+i. 

Here, we take on SDt n +i the Borel a-algebra associated to the open sets of the 
Berkovich topology. If x e 9Jt„ corresponds to P 1 \ S, then C induces a distribution 
on fn+ii x )i hence (after renormalisation) a probability distribution on <^ + i(d), 
where d E !D„_i is the point corresponding to the dendrogram @(S). The similar 
holds true for general families of dendrograms, e.g. time series of particles. 

6. Hidden vertices 

A vertex v in a p-adic dendrogram 3l is called hidden, if the class corresponding 
to v is not the top class and does not directly contain data points but is composed 
of non-trivial subclasses. The subforest of @ spanned by its hidden vertices will be 
denoted by @ h , and is called the hidden part of 2$. The number b$ of connected 
components of measures how the clusters corresponding to non-hidden vertices 
are spread within the dendrogram Qi. We give bounds for &(, an d the number v h of 
hidden vertices, and refer to Bradley (2006) for the combinatorial proofs (Theorems 
8.3 and 8.5). 

Theorem 6.1. Let $> G D n . Then 

v h< r l+l_ b h + 1 and b h<0LZA, 

where the latter bound is sharp. 

7. Conclusions 

Since ultrametricity is the natural property which allows classification and is 
pervasive in observational data, the techniques of ultrametric analysis and p-adic 
geometry are at ones disposal for identifying and exploiting ultrametricity. A p-adic 
encoding of data provides a way to investigate arithmetic properties of the p-adic 
numbers representing the data. 
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It is our aim to lay the geometric foundation towards p-adic data encoding. From 
the geometric point of view it is natural to perform the encoding by embedding its 
underlying dendrogram into the Bruhat-Tits tree. In fact, the dendrogram and 
its embedding are uniquely determined by the p-adic numbers representing the 
data. For this end, we give an account of p-adic geometry in order to define p-adic 
dendrograms as subtrees of the Bruhat-Tits tree. 

In the next step we introduce the space of all dendrograms for a given number of 
data points which, by p-adic geometry, is contained in the space 9Jl n of all marked 
projective lines, an object appearing in the context of the classification of Riemann 
surfaces. The advantages of considering the space of dendrograms rely on the fact 
that a conceptual formulation of moving particles as families of dendrograms is 
made possible, and its simple geometry as a polyhedral complex. Also, assigning 
distributions on 9Jt„ allows for probabilistic incorporation of further data to a given 
dendrogram. At the end, we give bounds for the numbers of hidden vertices and 
hidden components of dendrograms. 

What remains to do is to computationally exploit the foundations laid in this 
article by developping a code along these lines and apply it to Fionn Murtagh's 
task of finding ultrametricity in data. 
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